Wrangling data in R

Leonard Blaschek

“80% of data science is cleaning”

— Ancient proverb

A quick word about myself

Types of data

  1. Excel sheets
  2. Delimited text files
  3. Folders of raw data
  1. Insane, lawless text files
  2. Proprietary formats

Tidy Data1

  • No white space
  • One observation per row
  • One variable per column
  • No information in formatting



Spot the untidyness!

R fundamentals

ggplot()            # function
ggplot              # object
996107              # number
"ggplot"            # string
?ggplot()           # show help page 
library(tidyverse)  # use library() to load the tidyverse package

Package vignettes

?readr # navigate to package index and then vignettes

Function help pages

?read_tsv()

Arguments without default need to be supplied.

The Pipe: Using function output as input





library(readr)
nrow(read_csv("data/cleaned_example.csv"))
[1] 16
"data/cleaned_example.csv" |> 
  read_csv() |> 
  nrow()
[1] 16

The pipe is typed as either %>% or |>1

Delimited text files

readr::read_csv()          #Comma delimited, decimal points
readr::read_csv2()         #Semicolon delimited, decimal comma

readr::read_tsv()          #Tab delimited

readr::read_delim()        #Pick your own delimiter

readxl::read_xlsx()        #Excel files

Folders of files

data_files <- list.files(
  path = "/path/to/folder",     #Folder containing data files
  pattern = ".csv",             #Pattern that matches files of interest
  recursive = TRUE,             #Look in sub-folders
  full.names = TRUE             #Return full file path
)

Folders of files

data_files <- list.files(
  path = "/path/to/folder",     #Folder containing data files
  pattern = ".csv",             #Pattern that matches files of interest
  recursive = TRUE,             #Look in sub-folders
  full.names = TRUE             #Return full file path
)

library(purrr)
data <- map(
  data_files,
  \(x) read_tsv(x, id = "path") #Add colum containing the file path
  ) |> 
  list_rbind()

Data cleaning

Missing values

Long and wide data

Separating compound variables

Correcting data classes

Data analysis

Grouping

Mutate

Summarise

purrr

When you’re stuck

  1. Know which package/function you need? — Help pages and vignettes!
  2. Know what you want to do but not where to start? — Try an LLM, e.g. perplexity.ai
  3. I feel like I’ve done this before … — Keep your old scripts organised and annotated, chances are you’ll need that little hack you came up with again in a month or two.

Exercises!

Open up 2023_ggplot2_exercises.rmd and give it a try

Resources to go further